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Capacity of Random Channels with Large Alphabets 
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We consider discrete memoryless channels with input alphabet size n and output alphabet 
size TO, where to = [yn] for some constant 7 > 0. The channel transition matrix consists of 
entries that, before being normalized, are independent and identically distributed nonnega¬ 
tive random variables V and such that E[(DlogD)^] < 00. We prove that in the limit as 
n —>■ 00 the capacity of such a channel converges to Ent(D)/E[L] almost surely and in L^, 
where Ent(E) := E[ElogE] — E[E] log E[E] denotes the entropy of V. We further show 
that, under slightly different model assumptions, the capacity of these random channels con¬ 
verges to this asymptotic value exponentially in n. Einally, we present an application in the 
context of Bayesian optimal experiment design. 


1. INTRODUCTION 

Since Shannon’s seminal 1948 paper [1], channel capacity has become a fundamental concept 
in information theory, specifying the asymptotic limit on the maximum rate at which information 
can be transmitted reliably over a channel. In this work, we restrict ourselves to discrete memo¬ 
ryless channels (DMCs), that comprise a finite input alphabet A = {1,2,... ,n}, a finite output 
alphabet y = {1,2,, m}, and a conditional probability mass function expressing the probability 
of observing the output symbol y given the input symbol x, denoted by '^x,y Any DMC can be 
represented by a stochastic matrix W = (\Nx,y)xex,yey € [0,1]"'^"*, whose rows are normalized, 
i-6-) Yly£y'^x,y = 1 for all x G A. According to Shannon [1], the channel capacity of a DMC W is 
given by 

C(W) = max/(p, W) , (1) 

pgA„ 

where I(p,W) := p{x)D{\Nx,.\\{p\N){-)) denotes the mutual information and A„ := {x G 

= l,Xi > 0 for all f} the n-simplex. The channel law is described by V^x,y = 
P[y = y\X = x], (pW)(-) denotes the probability distribution of the channel output induced 
by p and W which is given by (pW)(y) := Ylxex for 2/ £ A and D(-||-) is the relative 
entropy. The optimization problem (1), while being convex, in general does not admit a closed form 
solution. Therefore, channel capacities are usually approximated with numerical algorithms such 
as the Blahut-Arimoto algorithm [2, 3], whose computational complexity of finding an additive 
e-close solution scales cubically in the alphabet size, and as such the computational cost required 
for an acceptable accuracy for channels with large input alphabets can be considerable (see [4] for 
more details). 

In this paper, we are interested in a particular class of DMCs which are characterized by the 
property that each entry of their channel matrix is an i.i.d. random variable before the rows are 
normalized. Two different scenarios are considered; first we assume that each entry of the channel 
transition matrix is a nonnegative i.i.d. random variable V before being normalized and that 
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m = \^n\ for some constant 7 > 0. Using duality of convex optimization, we prove in Theorem 2.3 
that as n — >■ 00 the capacity of such a (random) DMC converges to ^ — log /xi almost surely 
and in L^, where /ii := E[U] > 0 and ^2 '■= E[UlogU]. Second, we consider a more general 
setup under slightly different model assumptions, where each entry Vx^y of the channel transition 
matrix, before being normalized, is independent and distributed on the nonnegative real line such 
that for all X G T and for all y G T we have ygn := ^ YlyO’ = n Yjx&x M^x,y] and 

■= ^Ylyey log In Theorem 5.2 we show that the capacity of such a random 

DMC converges exponentially in n to its asymptotic value lim„^oo — log //i,n in probability. 
Therefore, for the considered class of random DMCs the capacity, as the alphabet sizes tend to 
infinity, admits a closed form expression. We will show that this favourable property can be 
exploited in applications in the context of Bayesian optimal experiment design. 

In the literature there exists a variety of extensively studied channel models that are described 
by random constructions, where one observes that in the limit as the blocklength tends to infinity 
the capacity converges to a deterministic value. This is sometimes viewed as a manifestation of 
of diversity [5, 6]. A common model studied in [5, 6] is of the form y = Gx + w, where x is an 
n-dimensional input vector and y represents an m-dimensional output vector. G is modeled as a 
random matrix (the simplest example is the one where G has i.i.d. entries) and w denotes additive 
noise. To the best of our knowledge the random channel model that is considered in this article 
has never been addressed directly in the literature. 

Understanding the behavior of random channels is important from a theoretical viewpoint as 
random constructions can serve as a powerful tool in order to prove statements. For example in 
quantum information theory there was a long-standing conjecture that the Holevo capacity of a 
quantum channel is additive [7]. A few years ago, Hastings showed that the conjecture is false [8], 
by constructing a random, high-dimensional channel whose Holevo capacity is not additive. Despite 
considerable effort, there is no deterministic, low-dimensional quantum channel known for which 
we can prove that the Holevo capacity is not additive. This example shows the power of random 
constructions as a proof technique. 

Notation. — The logarithm with basis 2 is denoted by log(-) and the natural logarithm by In(-). 
We consider DMCs with an input alphabet A = {l,2,...,n} =; [n] and an output alphabet 
y = {1,2,... , m} =: [m]. The channel law is summarized in a stochastic matrix W G N4n,m, 
where \Nx,y '■= D[U = y\X = x] and A4n,m denotes the set of all stochastic n x m matrices. The 
input and output probability mass functions are denoted by the vectors p G A„ and q G A^, 
where we define the standard n-simplex as A^^ := {x G M"'| > 0 for all i}. For a 

probability mass function p G A„, we denote the Shannon entropy by H{p) := 

It is convenient to introduce an additional variable for the conditional entropy of Y given X as 
r G M"', where Vx '■= We denote the maximum (resp. minimum) between a 

and 6 by aV6 (resp. aAb) and by [•] the ceiling function. Given a nonempty set A C M, its Borel 
u-algebra is denoted by 13(A). The uniform distribution with support A is denoted by 7/(A) and 
the exponential distribution with rate parameter A > 0 by T(A). The Dirichlet distribution on the 
n-simplex with concentration parameter a G M>g is denoted by Dir(ai,..., a„) and the lognormal 
distribution with rate parameters ^ G M and u > 0 by lnM(z,a). The Dirac delta distribution 
is denoted by (5(-). By convention when refering to sets or functions, measurable means Borel- 
measurable. Let 17 be a nonnegative real-valued integrable random variable. The entropy of U is 
defined as Ent{U) := E[U\ogU] - E[17] log E[[7]. 

Structure. — In Section 2 the asymptotic capacity of random DMCs having the form explained 
above is determined. Section 3 contains a numerical simulation of a random DMC whose rows are 
uniformly distributed over the n-simplex. An application of the asymptotic capacity in terms of 
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optimal design of experiments is presented in Section 4. Section 5 proves the exponential rate of 
convergence for the capacity of such random DMCs under slightly different model assumptions. 


2. ASYMPTOTIC CAPACITY 

Consider a probability space (0, A, P) and let (14,y)xG[n],yeH be a sequence of i.i.d. nonnegative 
random variables on fl. We define the channel transition matrix^ := {'^^:^'iP^)xe[n],ye[m] by 

= Vx^y/ Yly^[m] ^x,yi *bat can be easily verified to be a stochastic matrix, i.e., 0 < 

for all X G [n],y G [m] and 1 x G [n]. We impose the following assumption 

on the random variables 14 ,y• 

Assumption 2.1. The random variables 14,y are such that E[14,j/] > 0 and E[(14,y log 14,y)^] < 
oo. 


Note that Assumption 2.1 implies that E[l/^y] < oo. The following assumption provides a 
relation between the input and output alphabet size that is required for the main theorem. 

Assumption 2.2. There is a positive constant 7 G M>o such that the output alphabet size is 
given by m = \jn \. 

It can be easily shown that the capacity C'(W(^’"')) of such a (random) DMC as well as the 
optimal input distribution are random variables. In order to do so, note that the mapping 9 

{yx,y)x&[n],yG[m] )x&[n],yG[m] = 14,y/EyGM ^ Constructing the channel clearly 

is measurable and therefore, invoking Lemma A.l (see Appendix A for details), the channel capacity 
C(W(^’"')) is a function from 12 to M>o that is {A, . 8 (M>o))-measurable and hence a random variable. 
Therefore, we can state the main result as follows, where we define ni := E[14,j/] and ^2 '■= 
E[Ea:,t/ \ogVx,y]. 

Theorem 2.3 (Asymptotic capacity). Under Assumptions 2.1 and 2 . 2 , as n —)• 00 the capacity 
converges to ^ — log/ri almost surely and in L^. 

Proof. See Section 2 A. □ 

Under weaker assumptions on the channel matrix we can prove a weaker convergence statement 
for the asymptotic capacity. 

Corollary 2.4 (Asymptotic capacity). Under Assumption 2.2 and E[I4,j^] > 0 and E[I 4 , 3 /log I4,y] < 
00 , the capacity C'(W*^^’4) of the DMC, as n ^ oo, converges to ^ — log/Ui almost surely. 

Proof. Follows directly from the proof of Theorem 2.3. □ 

Let us discuss some implications of Theorem 2.3 and provide a few examples. 

Remark 2.5 (Connection to 4>-entropy). For any convex function : ]R>o ^ M, the ^-entropy of 
a nonnegative real-valued integrable random variable U is defined by 

Ent$(U) := E[4>([/)] - $(E[C/]), 


^ In Assumption 2.2 the output alphabet size is assumed to be a function of the input alphabet size and as such the 
index m is suppressed in the notation of the channel matrix. 
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see [9, Chapter 14] for a comprehensive study. Let us consider the function = ulogu and 
denote the resulting ^-entropy by Ent(?7) that simplifies to 

Ent(17) = E[;71ogl7] - E[17]log E[17] . 


Under Assumptions 2.1 and 2.2, using the <h-entropy, Theorem 2.3 can be stated equivalently as 


lim 

n—^oo 


Ent( 14 ,j/) 


Remark 2.6 (Properties of the asymptotic capacity). The asymptotic capacity described in The¬ 
orem 2.3 


(i) is nonnegative by Jensen’s inequality, since M>o 3 x x log x G M is a convex function. 


(ii) can be zero. Consider random variables Vx,y such that P[14,y = a] = 1 for some a G M>o. 

This then gives /xi = E[14,?;] = ol and /X 2 = E[14,i; log 14,y] = a log a, which leads to 

^-log/ii = 0. 

(hi) can be arbitrarily large. Consider random variables Vx^y such that for some e G (0,1), 

P[14,y = 0] = 1 — e and P[14,j/ = 1] = e. This then gives = E[14^y] = e and fj,2 = 

E[ 14 ^y log Vx^y] = 0 and hence ^ — log gti = log ^ which tends to infinity as e —>■ 0. 

(iv) admits the homogeneity property lim^^oo C'(W("^’"')) = lim„_).oo C'(W(^’”')) for any a > 0. 
This follows by Remark 2.5, as 


lim C(W(“^’’")) 

n—^oo 


Ent(aUii) 

E[ctl/ii] 


Ent(Un) 

E[Uii] 


lim C(W(E^)), 

n—¥oo 


where the second equality uses [10, Remark 3.3.1] 


Ent(aVii) = E[aUii log(aVii)] — E[al/ii] log( E[q:Vii]) 

= a E[Vii log Uii] — aE[Vii] log E[Vii] = aEnt(Uii). 


Example 2.7 (Exponential distribution). Consider a DMC as defined above using an exponential 
distribution with rate parameter A > 0. Then for n —)• oo its capacity converges to almost 
surely and in L^, where k denotes Euler’s constant. This follows directly from Theorem 2.3, since 
for Vx^y ~ S{X) we have = E[ 14 ^j^] = ^ and fi 2 = E[ 14 , 3 /log 14 ,y] ~ 

asymptotic capacity is constant (i.e., independent of A) is a direct consequence of the homogeneity 
property in Remark 2.6, since al4,i/ ~ ^(q) « > 0. 

Example 2.8 (Symmetric Dirichlet distribution). Consider a DMC that is described by an n x n 
channel transition matrix, whose rows are independent random variables on the n-simplex. 

More precisely, let the rows be i.i.d. random variables according to the symmetric Dirichlet 

distribution Dir(A,...,A) with concentration parameter A > 0. R is known [11, Theorem. 4.1, 
p. 594] that for n exponentially distributed i.i.d. random variables 14,i, • • •, Ex,n ~ ^(^)) the 
multivariate random variable := 14,-/admits a symmetric Dirichlet distribution 

Dir(A,..., A), that is the uniform distribution over the n-simplex for A = 1. Hence, by Example 2.7 
the capacity of a channel -R/ith i.i.d. symmetric Dirichlet distributed rows converges to 

almost surely and in as n —>■ oo, where k denotes Euler’s constant. 
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Example 2.9 (Lognormal distribution). Consider a DMC (with n = m) as defined above using 

a lognormal distribution expAA(z,(T) with parameters z G M and u > 0. Then for n —)• oo its 

capacity converges to almost surely and in L^. This follows directly from Theorem 2.3, 

2 

since for ~ expAA(^,(T) we have = exp( 2 ; + and 112 = E[ 14 ,y log 14 , 3 /] = 

exp(z + ^). We note that aVx^y ~ expAA( 2 ; + lna, a) for positive a, which by the homogeneity 
property (cf. Remark 2.6) implies that the asymptotic capacity does not depend on z. 

Four additional examples considering the uniform, gamma, chi-squared and beta distribution 
can be found in Appendix B. Before we present a rigorous proof of Theorem 2.3 in the next section 
let us sketch an informal motivation, that might provide some intuition about the proof. 

Let us assume that the i.i.d. random variables Vx,y take values in a finite set [k], for some 
k G N. Statistically as the input and output alphabet get larger (i.e., n,m ^ k), the channel 
matrix resembles a weakly symmetric channel (i.e., every row is a permutation of every 

other row and all the column sums are equal). It is known [12, Theorem 7.2.1], that the capacity of 
a weakly symmetric channel is given by logm — for x G [n] and that the uniform 

input distribution is capacity achivieng, i.e., the optimal input distribution does not depend on 
the channel realization. We further note that the capacity of such channels only depends on the 
statistics of the channel entries. In Section 2 A, to prove Theorem 2.3, we derive an analytical 
upper and lower bound for the capacity and show that in the limit n ^ 00 they coincide at the 
value predicted by Theorem 2.3. The upper bound is shown to be logm — max^-gj^] and 

the lower bound /(p, , where p is the uniform distribution on [nj. 


A. Proof of Theorem 2.3 


To keep the notation simple we denote the channel transition matrix by W. We refor¬ 

mulate the problem (1) by introducing an additional decision variable q G representing the 
output distribution of the channel, together with the coupling constraint \N^p = q. Whereas the 
Lagrange dual problem to (1) can only be implicitly expressed through the solution of a system 
of linear equations (as reported in [13, 14]), introducing the new decision variable q allows us to 
derive an explicit and simple Lagrange dual problem. It can be shown (see e.g. [4, Lemma 1]) that 
the optimization problem (1) is equivalent to 


(primal program): 


max —r^p + H(q) 

s. t. V\l^p = q 

p G An, q G Ar 


where := —Yl^=i'^x,y^og\Nx,y The Lagrangian dual program to (2) is 

(dual program): min{G(A) + F{X) : A G , 
A 


( 2 ) 

(3) 


where G,F : —>• M are given by 


G(A) 


max —r^p + \^\N^p 
p 

s.t. pGAn 


and F{X) 


max H{q) — X^q 
Q 

s.t. q G An%- 


Note that since the coupling constraint = g in the primal program (2) is affine, the set of 
optimal solutions to the dual program (3) is nonempty [15, Proposition 5.3.1] and as such the 
optimum is attained. As shown in [4, Section 2], G and F have analytical solutions given as 



G(A) = max (WA — r)^ and F{X) = log 

ie[n] 


( 4 ) 
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Lemma 2.10. Strong duality holds between (2) and (3). 

Proof. The proof follows by a standard strong duality result of convex optimization, see [15, Propo¬ 
sition 5.3.1]. □ 

Weak duality of convex programming implies that the dual always is an upper bound to the 
primal problem, i.e., for every p E and for every A E M™, C'Lg(W) := I{p, W) < G{X) + F{X) =: 

By following the proof of Lemma A.l, one can show that the mapping 9 W i—>■ 
C'u^(W) E M>o is measurable for any A E M"* and as such C'u^(W) is a random variable. To prove 
Theorem 2.3, we consider the upper bound := G(0) -|- T(0) which is the Lagrange dual 

function evaluated at A = 0. As a lower bound we consider the mutual information evaluated for a 
uniform input distribution, i.e., := /(p, W), where Pi = ^ for all i E [n]. Note that by 

the measurability of the mutual information ^ random variable. We will show that 

and converge to the asymptotic capacity predicted by Theorem 2.3 in the 

limit n —7> oo which then proves the assertion. 

Lemma 2.11. Under Assumptions 2.1 and 2.2, for n oo, the random variable 
converges to ^ — log pi almost surely and in L^. 

Proof. According to (4) we have 

c{f=°)(W) = G(0 )+F(0 ) 

= max{—Ta,} -|- logm 

xG[n] 

+ logm. (5) 

According to Lemma C.l, for every x E [n] as n ^ oo, + logm converges to 

^ — log Pi almost surely and in L^. This finally proves the assertion. □ 

Lemma 2.12. Under Assumptions 2.1 and 2.2, for n oo, the random variable 
converges to ^ — logpi almost surely and in L^. 

Proof. The mutual information for a uniform input distribution, i.e., Pi = ^ for all i E [n] can be 
written as 


= max 

xO\n] 



x,y 


log W. 




(W) = i ^ \Nx,y log n + log Alx,y - log Wfc,, 

xOi[n\,yO.[m\ y fcG[n] 

= ^ ^ Wa,,y (logn-y logWa,,y) - i ^ \Nx^y\og (6) 

xG[n],y€[m] x.,y£[n] kG[n] 

According to Lemma C.2, for n ^ oo, ^J2xG[n],y&[rn]'^r:,y^ogY^f^^^.,^^\Nk,y converges to - logy 
almost surely and in L^. We can simplify the first part of (6) by making use of the fact that y\lx,y 
is normalized, i.e., that Ylye[m] '^x,y = 1 for all x E [n]. 


^ E (log n -|- log '^x,y) 

a;G[n],3/G[m] 


— I logn 
n 


E 

xOi[n] yO:[m] 


E 


'^x,y+ ^ log Wa,,y 


7 


= log n + 


^ ^ ^x,y^Og\Nx^y 

xe\n\,y£[m] 


( 7 ) 


Consider the upper bound 


logn + 


'y ^ ^x,y^Og\Nx^y 

x&[n],y&[m] 


< log n + max 

xG[n] 


10gW3;,j/ 

yG[m] 


= log n + ^ log 

3/GH 

= log m + ^ log - log (7 + ^) , 
i/e[m] 


( 8 ) 


for some i E [n], where := [ 7 n] — 7 n E [0,1) for all n. According to Lemma C.l, the right hand 
side of ( 8 ) converges to ^ — log fii — log 7 almost surely and in for n —>■ 00. We can also bound 
the same term from below as 


logn + 


^ ^ ^x,y log '^x,y 

x&[n],y&[m] 


> log n + min 

x£[n] 


^ log 
yeH 


= log n + ^ log '^X,y 

y£[m] 

= log m + ^ log Ws,y - log (^7 + —^ , 
ye[m] 


( 9 ) 


for some x E [n], where Sn := [ 7 n] — 7 n E [0,1) for all n. According to Lemma C.l, the right hand 
side of (9) converges to ^ — log/ri — log 7 almost surely and in L^ as n —>■ 00. Thus for n —)• 00, 
( 6 ) converges to ^ — log /ri in L^ which proves the assertion. □ 


Lemmas 2.11 and 2.12 complete the proof of Theorem 2.3 as C'l^^^(W) < C'(W) < °^(W). 


3. SIMULATION RESULTS 

In this section we compute the capacity of the DMCs introduced in Section 2 for finite alphabet 
sizes. For the computation we use a recently introduced method [4] which allows us to efficiently 
compute close upper and lower bounds to the capacity. Roughly speaking, the method [4] is an 
iterative accelerated first-order method that exploits duality of convex programming together with 
the fact that entropy maximization problems admit closed-form solutions. 

Example 3.1 (Exponential distribution). We consider a channel that is given by the stochastic 
matrix W = (y^x,y)x,y&[n] with '^x,y = 14 ,y/ Z]yg[n] ^x,y, where Vx^y are i.i.d. T(A) random variables 
with A = ^ for all x, y E [n]. As explained in Example 2.8 with this channel construction the rows 
\Nx^. admit a symmetric Dirichlet distribution with concentration parameter A = for all x E [n]. 
Figure 1 depicts the capacity of W for variable alphabet sizes. We perform five independent 
experiments for each value of n. On can observe that as n — 00 the capacity approaches the 
asymptotic limit as determined in Example 2 . 7 . In addition one can see that the variance between 
the capacity of the two independently chosen channels is decreasing for increasing alphabet sizes. 


8 


Oi 

a 

cS 

-a 

o 

0) 

a 


'o 

cS 

a 

cS 

o 


0.72 

0.7 

0.68 

0.66 

0.64 

0.62 

0.6 




■ upper bound ith experiment 

• lower bound ith experiment 

— asymptotic capacity 


100 200 300 400 500 600 

alphabet size n 


700 


800 


900 


1,000 


FIG. 1. For different alphabet sizes n we plot the capacity of five random channels, constructed as explained 
in Example 3.1. The method introduced in [4] is used to determine upper and lower bounds for the capacity 
for finite alphabet sizes n. The asymptotic capacity (for n —>■ oo) is depticted by the dashed line. 

4. APPLICATION IN BAYESIAN EXPERIMENT DESIGN 

The main objective of optimal experiment design is, based on prior knowledge, to select a most 
informative experiment, where we restrict attention to a certain notion of information that traces 
back to Shannon [1]; see [16] for a comprehensive survey. This section will motivate the study of a 
convergence rate for the asymptotic capacity (under more restrictive assumptions on the channel 
model) that is the content of Section 5. 

Let the random variable X E T := [n] describe a parameter to be determined with a prior 
probability distribution p E and let the random variable Y G y := [m] denote an observa¬ 
tion. Furthermore, consider a family of experiments where A C characterizes the 

set of all admissible experiments, and each experiment E A4n,m is characterized by the 

conditional probabilities := p(^)[y = y\X = x] for all x E T and y E The task of 

optimal experiment design is, given a prior distribution p E A„, to find the experiment that pro¬ 
vides the highest average amount of information, as described by the mutual information between 
the parameter and the observation [16, Definition 2], i.e., the goal is to find A* E A such that 
/(p, > /(p, for all A E A. This requires one to compute 

sup/fp,W(^’”)V (10) 

AeA ^ ^ 

The optimization problem (10) in general is difficult to solve. Moreover, an evaluation of the 
objective function, the mutual information, for a given A has a computational complexity of 0{nm) 


^ Strictly speaking, an experiment consists of the tuple {T, S(T), T, as pointed out in [16]. Since the 

conditional probability is our optimization variable and since X and y remain constant, with a slight abuse 

of notation, we call an experiment. 
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and as such for large sets X and y even solving (10) for local optimality can be computationally 
demanding. 

The task of designing optimal experiments has recently attracted interest in the context of bio¬ 
logical systems, where understanding about the underlying biological mechanisms emerges through 
iterations of modelling and experiments. Since experiments are expensive an effective selection of 
informative experiments is essential, see [17]. We will show in Section 4 A that the asymptotic 
capacity formula, given in Theorem 2.3, allows us to derive upper bounds on the expected infor¬ 
mation gain by an experiment for certain classes of (random) experiments. In addition. Theo¬ 
rem 2.3 provides an efficient method to select suboptimal experiments, that are almost optimal in 
our numerical example, see Example 4B. Let {Vx^y)x£[n],y&[m] be i.i.d. random variables for each 
A E A C and consider a channel transition matrix ■ 


A. Upper bound on maiximum expected information gain 


In the limit, as n —>■ oo, we can establish the following upper bound on the maximum expected 
information gain by an experiment. 


Proposition 4.1 (Upper bound on maximum expected information gain). For the family of chan¬ 
nels (W('^’^’"'));vgA introduced above that satisfy Assumptions 2.2 and 5.1, we have with high prob¬ 
ability 


lim sup/fp,W(^’^’") 


< sup 
agA 


Ent(ui,t^) 


E 


Vi 


(A) 


^,y 


( 11 ) 


The upper bound provided by Proposition 4.1 is particularly useful if the right-hand side of 
(11) admits a closed form solution, whereas the optimal information gain sup;^g^/(p, is 

difficult to compute (see Example 4B for more details). 


Before proving Proposition 4.1 we recall a preliminary standard result. 

Lemma 4.2 (Theorem 7.11 in [18]). Suppose X is a metric space, E is a subset of X and x is a 
limit point of E. Suppose fn-X^M. for each n € N and f : X ^ M are functions and An are 
numbers. //lim„_>.oo/n(a^) = fix) uniformly in X and limy^x fniv) = An pointwise over n € N. 
Then 


lim lim fniv) = hm lim/„(?/). 

y^x n—yoo n—>-oo y^x 


Proof of Proposition 4-1- We show that with high probability 


lim sup/fp,W(VV’")') < lim supC(W(VV’")) 
^^°°AgA a / n^-oo;^g^ 


sup lim C(W(^’^’”)) 
agA^^°° 


Ent (Ui,t^) 

= sup ^ 

AeA E 14 


(A) 


x,y 


( 12 ) 


The first inequality of (12) is trivial and the last equality follows by Theorem 2.3. Therefore it 
remains to prove that the first equality in (12) holds almost surely. Note first that the following 
property holds 

(i) The capacity of the channel converges uniformly in A to its asymptotic capacity in 

probability, i.e., 


for all e > 0, 


lim 

n—^oo 


p 


sup 

AgA 


C(W{A’E»)) _ 


Ent(ui,t^) 


E 


U. 


(A) 




> £ 


= 0 , 
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because by Theorem 5.2 (whose derivation is provided in the next section), we know that for each 
n there exists < oo and iV„ < 1 as well as C with > iV„ such that 


sup 

AgA 


C^\/\/{\v,n)^ _ Ent(Pi,y ) 


E 




(A) 


x,y 


< M„ on 


(13) 


Moreover, —)■ 0 and Nn —)• 1 as n ^ oo, which implies that 


for all e > 0, lim P 

n—^oo 


sup 

AgA 


(A)^ 




E 


v; 


(A) 


^,y 


> e 


= 0 


and hence, property (i) holds. Note also that the following property holds trivially since C'(W) < 
log(n A m) for any channel matrix W G Ain,m 

(ii) sup C'(W('''’E"')) < oo almost surely for all n G N. 

AgA 

Hence Lemma 4.2, using the two properties (i), (ii), implies that with high probability 

lim supC(W(^’En)) C(W(^’E")), 

AgA AgA 

which readily can be shown to imply the desired equality and therefore completes the proof. □ 


B. Example: Constrained lognormal distribution 


We consider the setting given in Example 2.9 and introduce a parameter A := {z, u^) G M x 
M>o. For given constants ii,Ui for i = 1,2 and > 0, we consider the family of experiments 
(W(^’ ^’’^))AgA, where 


A = 


|(z, fj^) G M X ]R>o : exp(z-|- ^) G [£i, ui], (exp((T^) - 1) exp(22; + u^) G [4,U2]| , (14) 


and E 




(A) 




= exp( 2 ; + and Var 


E 


(A) 


^■,y 


= (exp(iT^) — l)exp(2z + cr^). For this family of 
experiments Assumptions 2.2, 5.1 clearly hold and an upper bound to the maximum expected 
information gain provided by an experiment, using Proposition 4.1, can be stated in closed form. 

Proposition 4.3. In the limit n ^ oo an upper bound on the maximum information gain by an 
experiment from the family (14) is given with high probability by 


1 


lim sup/fp,W(^’En)\ < (21n2)-Mnf ^ + 

^^°°AgA A / 

Proof. According to Proposition 4.1 and Example 2.9 

lim sup/fp, < lim supC'(W*^^’E"’)) = sup lim C'(W^^’E’^)) 

n^oo \ } n-i>oo,,cA 


>AgA 


AgA’ 


= 


max 

/t 2 


21n2 


s.t. £i < exp ^ j < ui 

£-2 < (exp(cr^) — l) exp {2z + < U 2 

G M>o, 2 G M. 


(15) 
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By introducing the variable a := exp(2z + cr^), the optimization problem (15) can be rewritten as 

_2f_ 

2 In 2 

< Ui 

£2 < (exp(cr^) — 1) a < U2 
cr^ G M>o, a G M>o. 

The monotonicity of (exp(cr^) — l) a with respect to implies that the optimizers are uniquely 
given by a = and cr^ = In + 1^, which completes the proof. □ 

Let us consider the case where the prior distribution is uniform. In this case the upper bound 
(11) is tight, by following the proofs of Theorem 2.3 and Proposition 4.3. Figure 2 depicts for 
different alphabet sizes n in (a) the empirical mean of the maximum expected information gain 
(blue line) for 1000 experiments, which in general is difficult to compute in particular for higher 
dimensional examples than Example 4B. The red line represents the empirical mean of the subop- 
timal expected information gain, that is given by evaluating the mutual information for the optimal 
parameters for the asymptotic capacity, derived in Proposition 4.3 and as such is computationally 
much cheaper. The empirical variance of the maximum expected information gain (blue line) as 
well as the empirical variance of the suboptimal expected information gain (red line) are depicted 
in (b). 


max 

,a 

< S.t. 



alphabet size n 
(a) empirical mean 



alphabet size n 
(b) empirical variance 


FIG. 2. For different alphabet sizes n, we plot in (a) the empirical mean of the maximum expected in¬ 
formation gain (blue line) sup;^gA/(p, W-'^’'^’"^), where are independent channels and 

N = 1000. The red line represents the empirical mean of the suboptimal expected information gain, that is 

given by ^ where A are the optimal parameters for the asymptotic capacity, derived in 

Proposition 4.3. (b) depicts the empirical variance of the maximum expected information gain (blue line) 
as well as the empirical variance of the suboptimal expected information gain (red line). 


5. CONVERGENCE RATE 

This section addresses how fast the capacity of a channel with the form introduced in Section 2 
converges to the asymptotic value predicted by Theorem 2.3. In addition, we consider a different 
model for the channel construction compared to Section 2. Let (14,t/)3;e[n],ye[m] be a sequence of 
independent nonnegative random variables such that the following assumption holds, where we use 
the notation {x)\ = (0 V x)'^. 
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Assumption 5.1. There exist positive numbers K and T such that for all n,m G N 


(i) max I i X; ^[Vx,y] > ^ E E[(14,j^logl4,y)^] > <K for all y G [m] 

( xG[n] x£[n] J 

max I ^ E IE [ 1 / 4 ] , ^ ^ E[(14,y logl4,j/)^] I < A for all x G [n] 

L J/G[m] ye[m] I 


(ii) max<^ i ^ E[(K, 2 y)+] 4 ^ E[(E^,ylogE^,j/)^] I < f AT? ^ for all g > 3, 7 / G [m] 

a:e[n] a;G[n] J 

max I ^ E ^ E E[(E,,j,logE,,j,)^] I < for all g > 3, x G [n] 

[ 2yG[m] 2;GH J 

(iii) ^ Eye[m] JE[Ex,2y] = ^ Exe[n] ^[^x,y] for all X G [n] on the left hand side and all y G [m] on 
the right hand side 

(iv) Vx^y > 0 almost surely for all x G [n] and y G [m]. 

We denote by //i,„ := ^ Ej^eH = E ExgM ^ 2 ,n := ^ Ej^eM K,^] 

and define the channel transition matrix W = (Wa;^j/) 3 ,g[„] by \Nx,y = ^,y/Eye[m] Eet 
/ : IR>o X M>o —)• M>o denote the function 


f{t,n) := exp - 




2(A + rt) 


(16) 


The main difference between the random channel model considered in Section 2 and the one 
in this section is that here we assume that the random variables {Vx,y)x&[n],y&[m] are independent 
and such that Assumption 5.1 holds, whereas in Section 2 we assume that the random variables 
(f4,?;)xe[n],j/g[m] are independent and identically distributed and satisfy Assumption 2.1. Clearly 
Assumption 5.1 is stronger than Assumption 2.1, which allows us to state a rate of convergence. 
Note that Assumptions 5.1(i) and (ii) are necessitated by the use of Bernstein’s inequality (see 
Lemma D.l), Assumption 5.1(iii) relates to the link with weakly symmetric channels and Assump¬ 
tion 5.1(iv) significantly strengthens the positive mean assumption (Assumption 2.1). 

Theorem 5.2 (Rate of convergence). Under Assumption 5.1, the capacity of the DMC defined 
above satisfies for any t G M>o 


P 


C(W(^’’")) - ( ^ - log^i. 

\hl,n 


> t 


< (2/(at/2, m) -h f{^,m)) V 


{‘^f{at/4.,m) + f{^,m) + f{l3t/{2L),n) + f{fit/{2L),m)) , 


with 


.„(l+t)+M2,n + h2,n >0 ^ _ tyi^r. 


Ml.n(l-t)+A12,; 


otherwise 


, fit = 


2 + U 


L = 


a In 2 


and 


a = mm 


{ I ^ 

y=l fee[n] ^ 


Vk,: 


n 


j/eH ^k,y rn 
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Remark 5.3 (Exponential convergence). Note that is strictly larger than zero for any n E N 
by Assumption 5.1(iv). Moreover, Assumption 5.1(i) implies that there exists a constant S such 
that ^ 2 ,n < S for any n E N. Therefore, the parameters at and f3t in Theorem 5.2 can be bounded 
from below independently of n. Assumption 5.1(iv) further ensures that the parameter a can be 
bounded from below independently of n and as such the parameter L is bounded from above and 
below independently of n. Hence, Theorem 5.2 clearly implies exponential convergence in n. 

Assume that as n —>■ oo, and ^ 2 ,n converge and denote the limits by fii := lim„_j.oo 
and fl 2 ■= lim^^oo ^J- 2 ,n 

Corollary 5.4 (Asymptotic capacity). Under Assumptions 2.2 and 5.1, for n oo, the capacity 
of the DMC defined above converges to ^ — log/2i in probability. 

Proof. Follows directly from Theorem 5.2. □ 

Since the exponential concentration provided in Theorem 5.2 is summable, a direct application 
of the Borel-Cantelli Lemma [19, Theorem 2.3.1] allows us to improve Corollary 5.4 to almost sure 
convergence. 


A. Proof of Theorem 5.2 


The structure of the proof is such that we prove separately convergence rates for the lower and 
upper bounds of Section 2 (Propositions 5.5 and 5.6) respectively. The claim follows since the 
capacity is forced to be between the upper and lower bounds, hence converges at the worst among 
the two rates. 

Proposition 5.5. A random channel \N as introduced in this section with C'u^^^(W) given in (5) 
satisfies 


P 


Ul,n 


> t 


< ‘^f{atf2,n) + f{^,n) 


with 


1 

f 

if 1^2, n ^ 0 

otherwise 

1 

f m 1 

at = 1 

' /41,n(l+t)+/42,ri 

1 

where L = and a = min < 

1 m ^ hl,n / 

1 

L Ul,nU — t)+U2,n 

1 

[ y=^ ) 


Proof. See Appendix D. □ 

Proposition 5.6. A random channel W of the form introduced in this section with C'l^^^(W) 
given in (6) satisfies 


P 






LB 


Ul,n 


- log/il.r 


> t 

t 


< 


with 


a^= 1 Mi,n(i+l’)+A‘2,n ^ 




2f{atii,m) + f{ji,m) + /(A/(2L),’^) + f{fit/{2L),rn 


Pl,n(l —t)+P2,'n 


otherwise 


fit = 


2 + U 


L = 


1 

a In 2 


and 


a = mm 




n 


w 
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Proof. See Appendix D. 


□ 


Proof of Theorem 5.2. Theorem 5.2 follows directly from Propositions 5.5 and 5.6 as by definition 
C’Sr^^(W) < C'(W) < almost surely. □ 


6. CONCLUSION AND DISCUSSION 

In this article we studied the capacity of discrete memoryless channels whose channel transition 
matrix consists of entries that are nonnegative i.i.d. random variables V before being normalized. 
It was shown that under some mild assumptions on the distribution of the random variables, the 
capacity of such a channel as the dimension goes to infinity converges to the asymptotic capacity 
given by —log/ii almost surely and in L^, where pii := E[U] and fi 2 '■= E[Ulog V]. Interestingly, 
for some distributions, e.g., the uniform and exponential distribution, the asymptotic capacity is 
a constant. Furthermore, we have shown that the capacity of these random channels converges 
exponentially to its asymptotic value in probability. Finally, we provided an interpretation of the 
asymptotic capacity as an upper bound to the maximum expected information gain in the context 
of Bayesian optimal experiment design. 

For future work we aim to investigate if the asymptotic capacity of a random channel determined 
by Theorem 2.3 has an operational meaning in other scenarios, e.g., in the setup of fading channels 
or in Bayesian estimation. Furthermore, it would be interesting to study the variance of the 
capacity of such random channels and its decay rate. 


Appendix A: Measurability of the capacity 


We show that the the capacity of such a (random) DMC as well as the optimal input 

distribution are random variables. 


Lemma A.l (Measurability). For a channel constructed as explained above the mapping C : 
Mn,m ^ K>o given by C(W(^’’^)) = max^eA^/(p, is measurable. Furthermore, the (set¬ 

valued) mapping p* : M.n,m ^ ^n, p*(W('^’”)) = argmaXp^A^ I, describing the optimal 
input distribution, is measurable. 


Proof. Note that we have 


(7(W(^’"-)) = max{/(p, + (5 a„(p)}, where (5 a„(p) 


0 , if p € A„ 
—oo, otherwise. 


Since the mapping p i—?■ /(p, is concave and continuous for almost any , / is a normal 

integrand [20, Proposition 14.39]. Then, as shown in [20, Example 14.32], /(p, +<5 a„(p) 

is a normal integrand and as such the measurability of the mappings ^ C(W(^’")) and 

yyiFn) p*(W*''^’”^) follows by [20, Theorem 14.37]; see [20, Definition 14.1] for a definition of 
measurability of a set-valued mapping. □ 


Appendix B: Additional examples 

Example B.l (Uniform distribution). Consider a DMC as defined above, where the elements Vx^y 
are uniformly distributed with support [0, A] for some A > 0. It can be seen by the homogeneity 
property in Remark 2.6, that the asymptotic capacity cannot depend on A. More precisely, for 
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n —>■ oo the capacity converges to 1 — almost surely and in L^, since for Vx^y ~ U{[Q,A]), we 
have m = E[ 14 J = ^ and /i 2 = log 14 J 

Example B.2 (Gamma distribution). Consider a DMC as dehned in Section 2 using a gamma 
distribution with shape and scale parameter k > 0 and 6 > 0. For n —?• oo its capacity converges 
to ~ almost surely and in L^, where ^p{■) denotes the digamma function. This is 

a direct consequence of Theorem 2.3, since for 14,y ~ T{k,9) we have /ii = E[E 3 ;_j^] = k9 and 
^2 = E[E 2 ,^y log 14,j/] = , yve note that aVx^y ~ V{k,a9) for positive a, which by the 

homogeneity property (cf. Remark 2.6) implies that the asymptotic capacity cannot depend on 9. 

Example B.3 (Chi-squared distribution). Consider a DMC as defined in Section 2 using a chi- 
squared distribution with degrees of freedom A; G N. For n —?■ oo its capacity converges to 1 -|- 
jj^V’(l + |) “ log/c almost surely and in L^, where '0(') denotes the digamma function. This 
is a direct consequence of Theorem 2.3, since for 14,, ~ x‘^(k) we have ui = E[Er,,l = k and 
/i2 = E[E,,j,logl4,j,] =k + + I). ’ 

Example B.4 (Beta distribution). Consider a DMC as defined in Section 2 using a beta distri- 

jj _ 

bution with shape parameters a, (d > 0. For n —>■ oo its capacity converges to in 2 '^^ ~ log 

almost surely and in L^, where Hn denotes the n-th harmonic number. This is a direct con¬ 
sequence of Theorem 2.3, using that for Vx^y ~ beta(a, /?) we have = E[Ea;_y] = and 

k-2 = E[14,1/log 14,1/] = (^a+f3) ln2 ~ Ha+js)- 


Appendix C: Two technical lemmas 


Lemma C.l. Let Xi, X 2 , ■ ■ ■, Xn be i.i.d. nonnegative random variables with E[Ai] =: /ri > 0, 
E[AilogXj] =: /i 2 and E[(Aj log A,)^] < 00 . Let Yi = jy. then as n ^ 00 , X]r=i^*log^ + 

logn —^ ^ — log/ri almost surely and in L^. 

Proof. Let f,n ■= ^ Yl]=i •= log Aj. We then can write 


” " w. / w. 

Yi log Ti -g log n = ^ ' log ' 

1=1 V^i=i^4 


-|- logn 


/ n n \ 

1 1 n 

J^^i-log^n. 

n 


n 


(Cl) 


Note that E[(Ai log Aj)^] < 00 implies that A, has a finite second moment. Using the strong 
law of large numbers [19, Theorem 2.4.1], it follows that for n —>• 00 , —>• fii almost surely 

n h 2 almost surely. The convergence in follows by using the L^-weak law [19, 

Theorem 2.2.3.] instead of the strong law of large numbers. □ 


Lemma C.2. Let {Xij}i^[n],jG[m] be i.i.d. random variables taking values on M>o, with E[Aij] =: 

< 00 . IfYij = "’4., then for n 00 , where m := [qn] for some 7 G M>o, 


/r > 0 and E 

EILl 




E m t 7 

fc=l 

almost surely and in for every j G [m]. 
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Proof. By assumption we can write 






X/ X/ V"* X- 7 

2-^k=l 


2=1 


2=1 


1 


Et 


X, 




7^ + e„ ^ E Er=i 

where := [ 7 ^^] ~ in E [0,1) for all n E N. We can bound (C2) from above as 


(C2) 


1 


in + £j 


El 

i=l rn 


X, 


10 


E m TA 

k=l Xi,, 


< 


Et- 






7 n + e„ ^ ^ min ^^=1 Xt,k 


1 1 + 


£G[n] 


1 1 ^^r^ Y 

TK n 2-^i=l 


7 


(C3) 


where i E argmin^gj^] YlT=iXe,k- By the strong law of large numbers [19, Theorem 2.4.1] respec¬ 
tively the L^-weak law [19, Theorem 2.2.3.], the right hand side of (C3) converges to I /7 almost 
surely, respectively in for n —)• 00 . We can also bound (C2) from below by 


1 


E 


A', 


^0 


> 


1 


Et- 


XiJ 


in + Sn^ - Er=i Xi,k 7?^ + ^ - max Yl'k=i Xi,k 

iG[n\ 


1 1 


I l+£iL n 
'yn 




7 


(C4) 


e,k 


where i E argmax^gj^j Efe=i^^,fc- Again the strong law of large numbers and the L^-weak law 
ensure that the right hand side of (C4) converges to I /7 almost surely and in for n —>■ 00 . 
This then implies that (C2) converges to I /7 almost surely and in for n ^ 00 which proves the 
assertion. Note that the law of large numbers in (C4) works there because the number of sequences 
is linear in the sequence length. □ 

Appendix D: Proof of Theorem 5.2 

As already sketched in Section 5 A, the proof of Theorem 5.2 is implied by convergence rate 
statements for the upper and lower bounds of Section 2 (Propositions 5.5 and 5.6) respectively. 
This appendix provides a proof for the mentioned two propositions. To prove Proposition 5.5 we 
need a few preparatory lemmas. 

Lemma D.l (Bernstein’s inequality [9, Corollary 2.11]). Let Xi ,..., be independent real-valued 
random variables. Assume that there exist positive numbers K and T such that EEi ^ 

and 

n I 

E[(Ai)^] < for all integers g > 3. 

2=1 

If S = ^ Er=i(A'* “ 1E[W]), then for all t > 0 


—nH'^ 


P[5' > t] < exp 


2{K -\- Tnt) 
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Lemma D.2. Let X and Y be two random variables, i]i and r ]2 be two real constants such that 
P[|X — T/i| > t] < gi{t) and P[|P — r] 2 \ > t] < g 2 {t) for two functions : M —M>o, i S {1)2}. 
Then, P[|X + P - r/i - T/ 2 I > t] < 51 ( 5 ) + 52 ( 5 )- 

Proof. Consider the following four events At := {|X — gi\ > t}, Bt := {|y — 772 ] > t}, Ct '■= 
{|X + Y — gi — g 2 \ > i} and Dt := {|X — 7/i| + |P — 772 ! >t}. Using the triangle inequality and the 
union bound we find 

PfCt] < P[Il 7 ] < P [Ai/2 U i?i/2] < IP [^7/2] + ]P[-S 7 / 2 ] < ind) + <72(5)• 

□ 


Lemma D.3. Let X, Y be random variables and 771 G M, 772 € M>o constants such that 
P[|X — 771I > t] < gi{t) and P[ 1 U — 772] > t] < g2{t), for two functions <7* : M —)• M>o, i G { 1 , 2 } 
and t G M>o. Then 


P 


■ X 771 

> t 

. U 72 



< gi{oit) + g2{at) with at 


if + ^2 > 0 

— ,, f , — otherwise. 
rt2{l-t)+r]l 


Proof. Consider the three events := {\X—gi\ > t}, Bt := {[U—772] > t} and Cj := {ly —> t}- 
We first show that A'j: Ci Bf C with 7* := for t 772 and 771 + 772 > 0 . Given At n Bf 

it follows that X G [771 — t, 771 + t] and Y G [772 — t, 772 + t]. Given A^ C Bf, two possible extreme 
values of 1^ — ^1 are 

? 7 i + t _ m 
g2-t 772 
m _ hi-t 
?72 g2 + t 

where it is immediate that for 771 + 772 > 0, we have 'jt ^ It- We thus have ly — ^\ < 7t, which 
implies by definition that ^ 4 ^ n i?} C and thus 

P[C^J > P[Gl?nS}]. (D 3 ) 


+ gi) 


772(772 -t) 
t{gi + 772) 


^72(72 + t ) 


=: for t / 772 and 


=: 77, 


(Dl) 

(D 2 ) 


Using (D 3 ), de Morgan’s law and the union bound we hnd 


Pfc.,,] = 1 - p[qj < 1 - PK n B^] = P[^i u Bt] < nM + nBt] ■ 


(D 4 ) 


Solving 'jt = 


t{rii+ri2) 

mim-i) 


for t / 772 and inserting it into (D 4 ) proves the assertion for 771 + 772 > 0 . If 


7 i + 72 < 0 , we have 7^ > 77. Following the same lines as above, i.e., solving 7^ = 
and inserting it into (D 4 ) proves the assertion for 771 + 772 < 0 . 


*Gi+^2 ) 

mim+i) 


for t 

□ 


Lemma D.4. Let X be a random variable and g be a constant such that X, 77 G [a, 00 ) for a > 0 
and P[|X — 77 I > t] < g{t) for some function g : M>o —>■ M>o. Then P[| logX — log 77 | > f] < 9 ( 7 ) 

Proof. The function h : [a, 00 ) —>• M for a > 0 that maps x 1 —>■ logx is known to be Lipschitz 
continuous with Lipschitz constant L = . By definition of Lipschitz continuity we obtain 


P[| log X — log 77I > t] < P 





□ 































18 


Proof of Proposition 5.5. Let {Vx,y}x&[n],y&[m]: t^i,n and ^ 2 ,n as defined above and let Zx^y 
14 , 1 /log 14 ,y According to Assumption 5.1 and Bernstein’s inequality (Lemma D.l), 


P 


^x,y tl'l,: 
m 


y=l 


> t 


< exp 


and 


P 


m 


y ^ ^x,y P‘2,r 


y=l 


> t 


< exp 


( 2 (A- + Lt) )=^<‘-’"> 


/ — TTlP't^ \ 


According to Lemma D.3, (D5) and (D6) imply that 

E m ry 

y=l ^X,y fi2,n 


1 

p ™ 


m ^y=l 


> t 


<2f{at,m) VxG[n], 


(D7) 


for 


at 


m,nP+t)+y2,n 


All,n(l —t)+M2,Ti 


if Mgn + 1*2,n > 0 

otherwise. 


Lemma D.4 together with (D5) gives 


P 


log 




log Ail,n 


> t 




Vx G [n], 


(D8) 


with L = and a = min ^x,yi Finally, using the definition of °^(W) given 

in (5) we find 


P 


4b“°^(W) - 


/i 2 ,^ 
Ail,7^ 


- log Ail,' 


> t 


= P 


= P 


= P 


= P 


max < 

a:g[n] 

/ 

m 

Ew- 

j=i 

max < 

xG[n] 

m 

iSa 

m 


Er=i 

1 y_ 

m Z-^y=\ x,y 

± v”" V- 

m Zjy=l ^x,y 


E™- log \Nx,y > + log m - 


( ’ logAign) 

> t 

VAii.n / 





x,y 


■ log ( 1 


F, 


x,y 


+ log m — 


Ai2,r 

Ail,n 


- log All,: 


log ( ^ 


,y Ai2,n 


Ail,n 


14,y 


+ log m - ( - log Atl,r 

f^l,n 


> t 




y=l 


> t 


< 2 /(ai/ 2 ,m) + f{^,m), 


> t 


(D9) 

(DIO) 

(Dll) 


where in (D9) x denotes the x G [n] that achieves the maximum. Equation (DIO) follows by 
recalling that Zx^y := 14,ylogl4,y. The inequality finally uses (D7), (D 8 ) and Lemma D.2. □ 
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We next derive a few preparatory lemmas that are used to prove Proposition 5.6. 
Lemma D.5. Using the notation introduced above, for all y G [m], 


P 


Ew, 

fce[n] 


k,y 


n 


m 


> t 


< f{/3t, n) + f{l3t, m), with Pt := 


11^1,11 
2 + t' 


Proof. Fix an arbitrary y G [n] and dehne 


Vk,y 


Uy:=y"\Nky = -V , 

ikn] ' 


With k G argmax^£[„] Yle.&[m] ^ ^ ^k,e we can bound Uy from below 

and above by 


n 1 Vk^y 

Py,LB =-> ^ -j—- Tr~ 

ke[n] "I k,e 

Lemma D.l and Lemma D.3 give 


P 


A TT ” ^ ^k,y 

and Uy^uB = -> ^ ■ 

fcG[n] "I ^^ 6 [m] k,e 





+,LB- 

L ^ m 

> t 

= P 


n 


n X]fcG[n] ^k,y n 


m X]fe[m] ^k,e 


m 


> t 


< fiPun) + f{l3t,m), 

for /(•, •) and /?* as defined in (16) and the theorem. The same argument can be obtained to bound 
h’[|f4/,UB “ ^1 ^t] which then proves the assertion. □ 

Lemma D.6. Using the notation introduced above, for every ?/ G [n], 

< / {(^t/L,n) + / {PtlL,rn) 


P 


log ^ \Nk,y - log 
fcE[n] 


n 

m 


> t 


with A:=^, L = ^ and a = mm{Zke[n] 


2+t ’ ^ a In 2 

Proof. Follows directly from Lemmas D.4 and D.5. 
Lemma D.7. Using the notation introduced above 


□ 


P 


1 


n 


E w x,y log ~ 


xG[n],j/e[m] 


kG[n] 


> t 


with fit := 


tui 


g,, L = —+ and a = min | + t.er„i — + 

2+t aln2 \'^^^wiJ2ye[m]^k,y ' m 


< / {f3t/L,n) + / {f3t/L,m) 

}■ 


Proof. For an arbitrary y G [m], define the events 


At := 


log ^ - log 

ke[n] 


> t > and Bt := 


^ ^ log ^ Wfc,y - log 


x&[n\,y&[rn\ 


fee[n] 


> t 
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Suppose that for all y G [m], \ log Ylke[n] ^k,y — log <t, this implies that 


i W.,„log 


x,y&[n] 


ke\n\ 


<t, 


where we used that \Nx ,y — ^ for all X G [n], y G [m] and that for all x G [n], X]yG[m] '^x,y = 1- 
Thus P[^i] < or equivalently P[-Bt] < which together with Lemma D.6 proves the 

assertion. □ 

Proof of Proposition 5.6. Let x G argmin^-gj^] YlxyG[n] '^x,y^og\Nx,y. By definition of (\N) 

given in (6) we have 


P 


C<fe~")(W) - 


1^2,r. 
I^l,r 


- log 


> t 


= P 


= P 


logn+- ^ log - log/ipn) - - ^ y\/x,y log ^k,y 

U , , , , V /il,n / U , , , , ... 


a:e[n],yG[m] 


3;G[n],j/g[m] 


fcS[n] 


> t 


log m H— Y1 '^x,ylogV\/x,y 


a;G[n], 3 /G[m] 


M 2 ,r; 
6-1,r. 


- - x,y log Y,^k,y + log ^ 


- log/Xl,r 


> t 


xG[n],yG[m] 


fcG[n] 


< P 


logm + '^\Nx,yl0g\Nx,y- 

ye[n] 


6-2,n 
t^l,n 


log/il,n)-- y] Wa^^ylog Y] Wfc,y + log — 


x&[n\,y&[m] 


fcG[r! 


> t 


< 2f {at/4,m) + f{^,m) + f{l3t/(^2L),n) + f{/3t/(^2L),'m), 


where the final inequality uses similar steps as done in the derivation of (Dll) together with 
Lemma D.2 and Lemma D.7. □ 
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